Goto

Collaborating Authors

 quantization expert


MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Zhang, Jinhao, Zhang, Yunquan, Zhang, Boyang, Liu, Zeyu, Cheng, Daning

arXiv.org Artificial Intelligence

Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOT A quantization model, without incurring significant increases in inference latency. Quantization method plays a pivotal role in the field of machine learning, particularly in enhancing model efficiency and reducing resource consumption. As deep learning models grow increasingly complex, their demand for computational resources escalates, constraining deployment on resource-limited devices and increasing operational costs. Furthermore, quantization method streamlines the model optimization pipeline, enabling developers to achieve efficient deployment within shorter timeframes and accelerating time-to-market for AI-driven products. Consequently, quantization method serves not only as a critical enabler for improving the accessibility and practicality of machine learning models but also as a key facilitator in the broader dissemination of artificial intelligence technologies. However, their practical deployment faces several critical challenges.